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Abstract 

There is growing body of learning problems for which it is natural to organize the parameters 
into matrix, so as to appropriately regularize the parameters under some matrix norm (in order to 
impose some more sophisticated prior knowledge). This work describes and analyzes a systematic 
method for constructing such matrix-based, regularization methods. In particular, we focus on how 
the underlying statistical properties of a given problem can help us decide which regularization 
function is appropriate. 

Our methodology is based on the known duality fact: that a function is strongly convex with respect 
to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. 
This result has already been found to be a key component in deriving and analyzing several learning 
algorithms. We demonstrate the potential of this framework by deriving novel generalization and 
regret bounds for multi-task learning, multi-class learning, and kernel learning. 

1 Introduction 

As we tackle more challenging learning problems, there is an increasing need for algorithms that efficiently 
impose more sophisticated forms of prior knowledge. Examples include: the group Lasso problem (for 
"shared" feature selection across problems), kernel learning, multi-class prediction, and multi-task learning. 
A central question here is to understand the performance of such algorithms in terms of the attendant com- 
plexity restrictions imposed by the algorithm. Such analyses often illuminate the nature in which our prior 
knowledge is being imposed. 

The predominant modern method for imposing complexity restrictions is through regularizing a vector 
of parameters, and much work has gone into understanding the relationship between the nature of the reg- 
ularization and the implicit prior knowledge imposed, particular for the case of regularization with £2 and 
l\ norms (where one is more tailored to rotational invariance and margins, while the other is more suited to 
sparsity). When dealing with more complex problems, we need systematic tools for designing more compli- 
cated regularization schemes. This work examines regularization based on group norms and spectral norms 
of matrices. We analyze the performance of such regularization methods and provide a methodology for 
choosing a regularization function based on the underlying statistical properties of a given problem. 

In particular, we utilize a recently developed methodology, based on the notion of strong convexity, for 
designing and analyzing the regret or generalization ability of a wide range of learning algorithms (see e.g. 
Shalev-Shwartz [2007], Kakade et al. [2008]). In fact, most of our efficient algorithms (both in the batch and 
online settings) impose some complexity control via the use of some strictly convex penalty function either 
explicitly via a regularizer or implicitly in the design of an online update rule. Central to understanding these 
algorithms is the manner in which these penalty functions are strictly convex, i.e. the behavior of the "gap" 
by which these convex functions lie above their tangent planes, which is strictly positive for strictly convex 
functions. Here, the notion of strong convexity provides one means to characterize this gap in terms of some 
general norm rather than just Euclidean. 

The importance of strong convexity can be understood using the duality between strong convexity and 
strong smoothness. Strong smoothness measures how well a function is approximated at some point by its 
linearization. Linear functions are easy to manipulate (e.g. because of the linearity of expectation). Hence, 
if a function is sufficiently smooth we can more easily control its behavior. We further distill the analysis 
given in Shalev-Shwartz [2007], Kakade et al. [2008] — based on the strong-convexity/smoothness duality, 
we derive a key inequality which seamlessly enables us to design and analyze a family of learning algorithms. 



Our focus in this work is on learning with matrices. We characterize a number of matrix based regular- 
ization functions, of recent interest, as being strongly convex functions — allowing us to immediately derive 
learning algorithms by relying on the family of learning algorithms mentioned previously. Specifying the 
general performance bounds for the specific matrix based regularization method, we are able to systemati- 
cally decide which regularization function is more appropriate based on underlying statistical properties of a 
given problem. 

1.1 Our Contributions 

We can summarize the contributions of this work as follows: 

• We show how the framework based on strong convexity /strong smoothness duality (see Shalev-Shwartz 
[2007], Kakade et al. [2008]) provides a methodology for analyzing matrix based learning methods, 
which are of much recent interest. These results reinforce the usefulness of this framework in providing 
both learning algorithms, and their associated complexity analysis. For this reason, we further distill the 
analysis given in Shalev-Shwartz [2007], Kakade et al. [2008] by emphasizing a key inequality which 
immediately enables us to design and analyze a family of learning algorithms. 

• We provide template algorithms (both in the online and batch settings) for a number of machine learning 
problems of recent interest, which use matrix parameters. In particular, we provide a simple derivation 
of generalization/mistake bounds for: (i) online and batch multi-task learning using group or spectral 
norms, (ii) online multi-class categorization using group or spectral norms, and (iii) kernel learning. 

• Based on the derived bounds, we interpret how statistical properties of a given problem can help us 
decide which regularization function is appropriate. For example, for the case of multi-class learning, 
we describe and analyze a new "group Perceptron" algorithm and show that with a shared structure 
between classes, this algorithm significantly outperforms previously proposed algorithms. Similarly, for 
the case of multi-task learning, the pressing question is what shared structure between the tasks allows 
for sample complexity improvements and by how much? We discuss these issues based on our regret 
and generalization bounds. 

• Our unified analysis significantly simplifies previous analyses of recently proposed algorithms. For 
example, the generality of this framework allows us to simplify the proofs of previously proposed regret 
bounds for online multi-task learning (e.g. Cavallanti et al. [2008], Agarwal et al. [2008]). Furthermore, 
bounds that follow immediately from our analysis are sometimes much sharper than previous results 
(e.g. we improve the bounds for multiple kernel learning given in Lanckriet et al. [2004], Srebro and 
Ben-David [2006]). 

1.2 Related work 

We first discuss related work on learning with matrix parameters then discuss the use of strong convexity in 
learning. 

Matrix Learning: This is growing body of work studying learning problems in which the parameters can 
be organized as matrices. Several examples are multi-class categorization (e.g. Crammer and Singer [2000]), 
multi-task and multi-view learning (e.g. Cavallanti et al. [2008], Agarwal et al. [2008]), and online PCA 
[Warmuth and Kuzmin, 2006]. It was also studied under the framework of group Lasso (e.g. Yuan and Lin 
[2006], Obozinski et al. [2007], Bach [2008]). 

In the context of learning vectors (rather than matrices), the study of the relative performance of different 
regularization techniques based on properties of a given task dates back to Littlestone [1988], Kivinen and 
Warmuth [1997]. In the context of batch learning, it was studied by several authors (e.g. Ng [2004]). 

We also note that much of the work on multi-task learning for regression is on union support recovery — 
a setting where the generative model specifies a certain set of relevant features (over all the tasks), and the 
analysis here focuses on the conditions and sample sizes under which the union of the relevant features can 
be correctly identified (e.g. Obozinski et al. [2007], Lounici et al. [2009]). Essentially, this is a generalization 
of the issue of identifying the relevant feature set in the standard single task regression setting, under l\ 
regression. In contrast, our work focuses on the agnostic setting of just understanding the sample size needed 
to obtain a given error rate (rather than identifying the relevant features themselves). 

We also discuss related work on kernel learning in Section 6. Our analysis here utilizes the equivalence 
between kernel learning and group Lasso (as noted in Bach [2008]). 

Strong Convexity/Strong Smoothness: The notion of strong convexity takes its roots in optimization. 
Zalinescu [2002] attributes it to a paper of Polyak in the 1960s. Relatively recently, its use in machine 



learning has been two fold: in deriving regret bounds for online algorithms and generalization bounds in 
batch settings. 

The duality of strong convexity and strong smoothness was first used by Shalev-Shwartz and Singer 
[2006], Shalev-Shwartz [2007] in the context of deriving low regret online algorithms. Here, once we choose 
a particular strongly convex penalty function, we immediately have a family of algorithms along with a regret 
bound for these algorithms that is in terms of a certain strong convexity parameter. A variety of algorithms 
(and regret bounds) can be seen as special cases. 

A similar technique, in which the Hessian is directly bounded, is described by Grove et al. [2001], Shalev- 
Shwartz and Singer [2007] . Another related approach involved bounding a Bregman divergence [Kivinen and 
Warmuth, 1997, 2001, Gentile, 2003] (see Cesa-Bianchi and Lugosi [2006] for a detailed survey). Another 
interesting application of the very same duality is for deriving and analyzing boosting algorithms [Shalev- 
Shwartz and Singer, 2008]. 

More recently, Kakade et al. [2008] showed how to use the very same duality for bounding the 
Rademacher complexity of classes of linear predictors. That the Rademacher complexity is closely related to 
Fenchel duality was shown in Meir and Zhang [2003], and the work in Kakade et al. [2008] made the further 
connection to strong convexity. Again, under this characterization, a number of generalization and margin 
bounds (for methods which use linear prediction) are immediate corollaries, as one only needs to specify the 
strong convexity parameter from which these bounds easily follow (see Kakade et al. [2008] for details). 

The concept of strong smoothness (essentially a second order upper bound on a function) has also 
been in play in a different literature, for the analysis of the concentration of martingales in smooth Banach 
spaces [Pinelis, 1994, Pisier, 1975]. This body of work seeks to understand the concentration properties of a 
random variable \\X t ||, where X t is a (vector valued) martingale and || • || is a smooth norm, say an L p -norm. 

Recently, Juditsky and Nemirovski [2008] used the fact that a norm is strongly convex if and only if 
its conjugate is strongly smooth. This duality was useful in deriving concentration properties of a random 
variable ||M||, where now M is a random matrix. The norms considered here were the (Schatten) L p -matrix 
norms and certain "block" composed norms (such as the || • ||2, 9 norm). 

1.3 Organization 

The rest of the paper is organized as follows. In Section 2, we describe the general family of learning 
algorithms. In particular, after presenting the duality of strong-convexity/strong-smoothness, we isolate an 
important inequality (Corollary 4) and show that this inequality alone seamlessly yields regret bounds in the 
online learning model and Rademacher bounds (that leads to generalization bounds in the batch learning 
model). We further highlight the importance of strong convexity to matrix learning applications by drawing 
attention to families of strongly convex functions over matrices. To do so, we rely on the recent results of 
Juditsky and Nemirovski [2008]. In particular, we obtain a strongly convex function over matrices based 
on strongly convex vector functions, which leads to a number of corollaries relevant to problems of recent 
interest. Next, in Section 3 we show how the obtained bounds can be used for systematically choosing an 
adequate prior knowledge (i.e. regularization) based on properties of the given task. We then turn to describe 
the applicability of our approach to more complex prediction problems. In particular, we study multi-task 
learning (Section 4), multi-class categorization (Section 5), and kernel learning (Section 6). Naturally, many 
of the algorithms we derive have been proposed before. Nevertheless, our unified analysis enables us to 
simplify previous analyzes, understand the merits and pitfalls of different schemes, and even derive new 
algorithms/analyses. 

2 Preliminaries and Techniques 

In this section we describe the necessary background. Most of the results below are not new and are based 
on results from Shalev-Shwartz [2007], Kakade et al. [2008], Juditsky and Nemirovski [2008]. Nevertheless, 
we believe that the presentation given here is simpler and slightly more general. 

Our results are based on basic notions from convex analysis and matrix computation. The reader not 
familiar with some of the objects described below may find short explanations in Appendix A. 

2.1 Notation 

We consider convex functions / : X — > M U {oo}, where X is a Euclidean vector space equipped with an 
inner product (•,•). We denote I* =RU {oo}. The subdifferential of / at x € X is denoted by df(x). The 
Fenchel conjugate of / is denoted by /*. Given a norm || • ||, its dual norm is denoted by j| • jj*. We say that 
a convex function is V-Lipschitz w.r.t. a norm || • || if for all x € X exists v € df(x) with < V. Of 
particular interest are p-norms, \\x\\ p = l^i |p) ^^/y . 

When dealing with matrices, We consider the vector space X = M. mxn of real matrices of size m x n 
and the vector space X = §" of symmetric matrices of size n x n, both equipped with the inner product, 
(X, Y) := Tr(X T Y). Given a matrix X, the vector c(X) is the vector that contains the singular values of 



X in a non-increasing order. For X G the vector A(X) is the vector that contains the eigenvalues of X 
arranged in non-increasing order. 

2.2 Strong Convexity-Strong Smoothness Duality 

Recall that the domain of / : X — > M* is {x : f(x) < 00} (allowing / to take infinite values is the effective 
way to restrict its domain to a proper subset of X). We first define strong convexity. 

Definition 1 A function f : X — > R* is /3-strongly convex w.r.t. a norm \\ ■ \\ if for all x,y in the relative 
interior of the domain of f and a € (0, 1) we have 

f(ax + (1 - a)y) < af(x) + (1 - a)f(y) - ±/3a(l - a)\\x - y\\ 2 

We now define strong smoothness. Note that a strongly smooth function / is always finite. 

Definition 2 A function f : X — > R is /3-strongly smooth w.r.t. a norm \\ ■ \\ if f is everywhere differentiable 
and if for all x, y we have 

f(x + y)<f(x) + (Vf(x),y) + ±f3\\y\\ 2 

The following theorem states that strong convexity and strong smoothness are dual properties. Recall that 
the biconjugate /** equals / if and only if / is closed and convex. 

Theorem 3 (Strong/Smooth Duality) Assume that f is a closed and convex function. Then f is /3-strongly 
convex w.r.t. a norm || • || if and only if f* is ^-strongly smooth w.r.t. the dual norm || • ||*. 

Subtly, note that while the domain of a strongly convex function / may be a proper subset of X (important 
for a number of settings), its conjugate /* always has a domain which is X (since if /* is strongly smooth 
then it is finite and everywhere differentiable). The above theorem can be found, for instance, in Zalinescu 
[2002] (see Corollary 3.5.11 on p. 217 and Remark 3.5.3 on p. 218). In the machine learning literature, a 
proof of one direction (strong convexity =>■ strong smoothness) can be found in Shalev-Shwartz [2007]. We 
could not find a proof of the reverse implication in a place easily accessible to machine learning people. So, 
a self-contained proof is provided in the appendix. 

The following direct corollary of Theorem. 3 is central in proving both regret and generalization bounds. 

Corollary 4 If f is f3 strongly convex w.r.t. j ■ \\ and /*(0) = 0, then, denoting the partial sum X^<i v j by 
Vi-.i, we have, for any sequence v\ , . . . , v n and for any u, 

n n -. n 

E fa- u > - /(«) ^ < E (v/>i.-i-i),«i) + Yb E imi* ■ 

i=l i=l P i=l 

Proof: The 1st inequality is Fenchel- Young and the 2nd is from the definition of smoothness by induction. ■ 

2.3 Machine learning implications of the strong-convexity / strong-smoothness duality 

We consider two learning models. 

• Online convex optimization: Let W be a convex set. Online convex optimization is a two player 
repeated game. On round t of the game, the learner (first player) should choose u> t £ W and the 
environment (second player) responds with a convex function over W, i.e. l t : W —> R. The goal of the 
learner is to minimize its regret defined as: 

n 1 n 

-J^hiwt)- min - E l t(w) . 
t=i t=i 

• Batch learning of linear predictors: Let V be a distribution over X x y. Our goal is to learn a 
prediction rule from X to y. The prediction rule we use is based on a linear mapping x 1— > (w,x), and 
the quality of the prediction is assessed by a loss function l((w, x) , y). Our primary goal is to find w 
that has low risk (a.k.a. generalization error), defined as L(w) = ¥,[l((w, x) , y)], where expectation is 
with respect to V. To do so, we can sample n i.i.d. examples from T> and observe the empirical risk, 

= — Y^i=i K( w > x i) 1 Hi)- The goal of the learner is to find w with a low excess risk defined as: 

L(w) — min L{w) , 

where W is a set of vectors that forms the comparison class. 

We now seamlessly provide learning guarantees for both models based on Corollary 4. We start with the 
online convex optimization model. 



Algorithm 1 Online Mirror Descent 



w\ <- V/*(0) 
for t = 1 to T do 

Play w t e yy 

Receive Z t and pick v t G dlt(wt) 
end for 



Regret Bound for Online Convex Optimization Algorithm 1 provides one common algorithm which 
achieves the following regret bound. It is one of a family of algorithms that enjoy the same regret bound 
(see Shalev-Shwartz [2007]). 

Theorem 5 (Regret) Suppose Algorithm 1 is used with a function f that is j3-strongly convex w.r.t. a norm 
|| • || on W and has f* (0) = 0. Suppose the loss functions l t are convex and V -Lipschitz w. r. t. the dual norm 
| • ||*. Then, the algorithm run with any positive rj enjoys the regret bound, 



T 

t=i t=i ' r 



\ ■ S^ 1t n . max„ ew /(u) rjV 2 T 



Proof: Apply Corollary 4 to the sequence — ryui , . . . , —t/vt to get, for all u, 

T T 
. V* /„. \ 1 II™. 1 1 2 



T T ^ T 

t=i t=i " t=i 



/_,\Vt,W t ) + 777 2^ 11^*1 



Using the fact that l t is F-Lipschitz, we get ||ut||* < V. Plugging this into the inequality above and rearrang- 
ing gives, Y%=i ( v t,Wt -u) < ^ + a |g^. By convexity of l u k{w t ) - h(u) < (v t ,w t - u). Therefore, 
J2t=i H w t) - ELi h(u) < ^ + !L ^ Z - Since the above holds for all u G W the result follows. ■ 

Generalization bound for the batch model via Rademacher analysis Let T = 

((xi, yi ),..., (x n , y n )) G (X x y) n be a training set obtained by sampling i.i.d. examples from T>. 
For a class of real valued functions T C M*, define its Rademacher complexity on T to be 



1 

sup - y^ei/(xi) 



Here, the expectation is over e/s, which are i.i.d. Rademacher random variables, i.e. P(e.; = — 1) = P(ei = 
+1) = |. It is well known that bounds on Rademacher complexity of a class immediately yield generalization 
bounds for classifiers picked from that class (assuming the loss function is Lipschitz). Recently, Kakade 
et al. [2008] proved Rademacher complexity bounds for classes consisting of linear predictors using strong 
convexity arguments. We now give a quick proof of their main result using Corollary 4. This proof is 
essentially the same as their original proof but highlights the importance of Corollary 4. 

Theorem 6 (Generalization) Let f be a (3-strongly convex function w.r.t. a norm \\ ■ \\ and assume that 
/*(0) = 0. Let X = {x : ||x||* < X} and W = {w : f(w) < / max }- Consider the class of linear 
functions, J- = {x i— > (w, x) : w G W}. Then, for any dataset T G X n , we have 



n r {F) < x, 



I2fn 



I3n ' 

Proof: Let A > 0. Apply Corollary 4 with u = w and Vi = AejXj to get, 

n ^2 n n 

sup y^ (w,XeiXi) < — V* \\eiXi\\l + sup f(w) + >J (V/*(vi :i _i), e,-^) 
A 2 X 2 n ™ 

< 9fl + /max + (V/*(ui^-l), CjiCi) . 

2 ^ „-_i 



Now take expectation on both sides. The left hand side is nXR-r(J- ) and the last term on the right hand side 

xx 2 , / max 

2/3 ~r nX 



becomes zero. Dividing throughout by nX, we get, TZj-(J- ) < + . Optimizing over A gives us the 



result. 

Combining the above with the contraction lemma and standard Rademacher based generalization bounds 
(see e.g. Bartlett and Mendelson [2002], Kakade et al. [2008]) we obtain: 

Corollary 7 Let f be a (3-strongly convex function w.r.t. a norm || ■ || and assume that /*(0) = 0. Let 
X = {x : ||x|| + < X} and W = {w : f(w) < / ma x}- Let I be an p-Lipschitz scalar loss function and let 
T> be an arbitrary distribution over X x y. Then, the algorithm that receives n Ltd. examples and returns w 
that minimizes the empirical risk, L{w), satisfies 



E 



Liw) — min L(w) 



< O pX< 



fjn 



where expectation is with respect to the choice of the n i.i.d. examples. 

We note that it is also easy to obtain a generalization bound that holds with high probability, but for simplicity 
of the presentation we stick to expectations. 

2.4 Strongly Convex Matrix Functions 

Before we consider strongly convex matrix functions, let us recall the following result about strong convexity 
of vector £ p norm. Its proof can be found e.g. in Shalev-Shwartz [2007]. 

Lemma 8 Let q G [1,2]. The function f : M. d — > M defined as f(w) = ^\\w\\g is (q — l)-strongly convex 
with respect to |j • |j g overM. d . 

We mainly use the above lemma to obtain results with respect to the norms || ■ \\% and || • ||i. The case 
q = 2 is straightforward. Obtaining results with respect to j| • is slightly more tricky since for q = 1 
the strong convexity parameter is (meaning that the function is not strongly convex). To overcome this 
problem, we shall set q to be slightly more than 1, e.g. q = j^rjvzi- For this choice of q, the strong convexity 
parameter becomes q — 1 = l/(ln(d) — 1) > 1/ ln(cf) and the value of p corresponds to the dual norm is 
p = (1 — 1/q) = \n(d). Note that for any x G K d we have 

IMloo < \\x\\ P < (d|MIL) 1/p - d 1/p N|oo = e HarlU < 3 Wx^ . 

Hence the dual norms are also equivalent up to a factor of 3: ||iu||i > \\w\\ q > ||w||i/3. The above lemma 
therefore implies the following corollary. 

Corollary9 The function f : R d -> M defined as f(w) = \\\w\\ 2 q forq = ^'"j^ is I /{Z\n{d))- strongly 
convex with respect to \\ ■ \\ \ overW 1 . 

We now consider two families of strongly convex matrix functions. 

Schatten g-norms The first result we need is the counterpart of Lemma. 8 for the q-Schatten norm defined 
as ||X|| S ( g ) := ||(j(X)|| 9 This result can be found in Ballet al. [1994]. 

Theorem 10 (Schatten matrix functions) Let q € [1,2]. The function F : K mx ™ — > R defined as F(X) = 
\ || cr(X) || g is (g — l)-strongly convex w.r.t. the q-Schatten norm ||X||s( s ) := ||cr(X)|| q over IR mxn . 

As above, choosing q to be ln |^?)-i ^ or m ' = mm { TO > n } gives the following corollary. 

Corollary 11 The function F : M mxn -> R defined as F(W) = ^\\W\\% q) for q = i^v^ is 
1/(3 ln(m')) -strongly convex with respect to \\ ■ \\s(i) overM. mxn . 



Group Norms. Let X = (X X X 2 . . . X") beam x n real matrix with columns X 1 € R m . We denote by 

||X|| rj p as 

||X|| r , p :=||(||X 1 || r) ...,||X"||r)||p. 

That is, we apply || • || r to each column of X to get a vector in R™ to which we apply the norm || • \\ p to 
get the value of ||Xj| r p . It is easy to check that this is indeed a norm. The dual of || • |j rjP is || • \\ s j where 
1 / r+ 1/s = 1 and l/p+ l/t= 1. The following theorem, which appears in a slightly weaker form in Juditsky 
and Nemirovski [2008], provides us with an easy way to construct strongly convex group norms. We provide 
a proof in the appendix which is much simpler than that of Juditsky and Nemirovski [2008] and is completely 
"calculus free". 



Theorem 12 (Group Norms) Let $ be absolutely symmetric norms on R m , R n . Let <1> 2 o y : R" -> R* 
denote the following function, 

($ 2 o^/)(x) :=$ 2 (^7,...,V^). (1) 

Suppose, (<£> 2 o is a norm on R n . Further, let the functions >J/ 2 and <i> 2 be o\- and o^-smooth w.r.t. "J and 
<f> respectively. Then, || ■ |||, ^ is (p\ + 02)-smooth w.r.t. || ■ ||*.$. 

The condition that Eq. (1) be a norm appears strange but in fact it already occurs in the literature. Norms 
satisfying it are called quadratic symmetric gauge functions (or Q-norms) [Bhatia, 1997, p. 89]. It is easy 
to see that || • || p for p > 2 is a Q-norm. Now using strong convexity /strong smoothness duality and the 
discussion preceding Corollary 9, we get the following corollary. 

Corollary 13 The function F : W nxn -> R defined as F(W) = \\\SV\\l. q for q = j^j^j is 1/(3 ln(n))- 
strongly convex with respect to \\ ■ 1 1 2, i overR mxn . 

2.5 Putting it all together 

Combining Lemma. 8 and Corollary 9 with the bounds given in Theorem. 5 and Corollary 7 we therefore 
obtain the following two corollaries. 

Corollary 14 Let W = {w : \\w\\i < W} and let lx, . . . ,l n be a sequence of functions which are X- 
Lipschitz w.r.t. \\ ■ \\oo. Then, there exists an online algorithm with a regret bound of the form 



1 ™ 1 n ( 

-^liW-mh-^ItW < 0\X 



w 



ln(d) 



Corollary 15 Let W = {w : \\w\\i < W} and let X = {x e R d : \\x\\oo < X}. Let I be an p-Lipschitz 
scalar loss function and let T> be an arbitrary distribution over X x y. Then, there exists a batch learning 
algorithm that returns a vector w such that 



L(w) - min L(w) J ln ( d ) 

Results of the same flavor can be obtained for learning matrices. For simplicity, we present the following 
two corollaries only for the online model, but it is easy to derive their batch counterparts. 

Corollary 16 Let W = {W e R fcxd : |W|| 2 ,i < W} and let l\, . . . ,l n be a sequence of functions which 
are X-Lipschitz w.r.t. \\ ■ {{2,00- Then, there exists an online algorithm with a regret bound of the form 



1 " 1 ™ / 

-Vl f (W ( ) - min - V/ t (W) < O [ XW 

t=i t=i \ 



ln(d) 



Corollary 17 Let W = {W e R kxd : ||W||g(i) < W} and let . . . ,l n be a sequence of functions which 
are X-Lipschitz w.r.t. \\ ■ |js(oo)- Then, there exists an online algorithm with a regret bound of the form 



t=l t=l \ 



3 Matrix Regularization 



We are now ready to demonstrate the power of the general techniques we derived in the previous section. 
Consider a learning problem (either online or batch) in which X is a subset of a matrix space (of dimension 
k x d) and we would like to learn a linear predictor of the form X i— > (W, X) where W is also a matrix of 
the same dimension. The loss function takes the form Z((W, X) , y) and we assume for simplicity that I is 
1-Lipschitz with respect to its first argument. For example, I can be the absolute loss, l(a, y) = \a — y\, or 
the hinge-loss, l(a, y) = max{0, 1 — ya}. 

For the sake of concreteness, let us focus on the batch learning setting, but we note that the discussion 
below is relevant to the online learning model as well. Our prior knowledge on the learning problem is 
encoded by the definition of the comparison class W that we use. In particular, all the comparison classes 
we use take the form W = {W : ||W|| < W}, where the only difference is what norm do we use. We shall 
compare the following four classes: 

m,i = {W : ||W|| M < W 1A } W 2a = {W : ||W|| 2 , 2 < W 2 , 2 } 

m,i = {W : ||W|| 2 ,i < W 2A } W s( i) = {W : ||W|| S(1) < W s{1) } 

Let us denote Xqo^qo = swp xeX HX^^. We define X 2 , 2 , X 2t00 , -Xs(oo) analogously. Applying the 
results of the previous section to these classes we obtain the bounds given in Table 1 where for simplicity we 
ignore constants. 



class 


Wi,i 


m, 2 


m,i 


W s( i) 


bound 




W 2 ,2 X 2t2 










w s{1) x s{oo)X J^ d ^ 



Table 1: List of bounds for learning with matrices. For simplicity we ignore constants. 

Let us now discuss which class should be used based on prior knowledge on properties of the learning 
problem. We start with the well known difference between Wi,i and W 2 , 2 - Note that both of these classes 
ignore the fact that W is organized as a k x d matrix and simply refer to W as a single vector of dimension 
kd. The difference between Wi i and W2.2 is therefore the usual difference between l\ and i 2 regularization. 
To understand this difference, suppose that W is some matrix that performs well on the distribution we have. 
Then, we should take the radius of each class to be the minimal possible while still containing W, namely, 
either ||W||i 1 or || "W|| 2,2- Clearly, j|W|| 2 ,2 < 1 1 "VV" 1 1 1 . 1 and therefore in terms of this term there is a clear 
advantage to use the class W 2 , 2 . On the other hand, X 2 . 2 > Xoo t00 . We therefore need to understand which 
of these inequalities is more important. Of course, in general, the answer to this question is data dependent. 
However, we can isolate properties of the distribution that can help us choose the better class. 

One useful property is sparsity of either X or W. If X is assumed to be s sparse (i.e., it has at most 
s non-zero elements), then we have X 2 2 < yfsXoo^. That is, for a small s, the difference between X 2i2 
and Xq^oo is small. In contrast, if X is very dense and each of its entries is bounded away from zero, e.g. 
X G {±l} kxd , then ||X||2.2 = V^fcdllXHo^oo. The same arguments are true for W. Hence, with prior 
knowledge about the sparsity of X and W we can guess which of the bounds will be smaller. 

Next, we tackle the more interesting cases of W 2 ,i and V\?sa)- For the former, recall that we first apply 
£ 2 norm on each column of W and then apply l\ norm on the obtained vector of norm values. Similarly, to 
calculate 1 1 X| 1 2,00 we first apply l 2 norm on columns of X and then apply l^, norm on the obtained vector of 
norm values. Let us now compare W2,i to Wi 1. Suppose that the columns of X are very sparse. Therefore, 
the i 2 norm of each column of X is very close to its norm. On the other hand, if some of the columns of 
W are dense, then 1 1 W 1 1 2, 1 can be order of y/k smaller than ||W|| 1,1. In that case, the class W2,i is preferable 
over the class Wi,i. As we show later, this is the case in multi-class problems, and we shall indeed present an 
improved multi-class algorithm that uses the class Wa,i. Of course, in some problems, columns of X might 
be very dense while columns of W can be sparse. In such cases, using Wi,i is better than using W 2 .\- 

Now lets compare W 2 ,i to W2.2- Similarly to the previous discussion, choosing W2.1 over W 2i 2 makes 
sense if we assume that the vector of i 2 norms of columns, (HW 1 ^, . . . , 1 1 W rf |] 2 ), is sparse. This implies 
that we assume a "group"-sparsity pattern of W, i.e., each column of W is either the all zeros column or is 
dense. This type of grouped-sparsity has been studied in the context of group Lasso and multi-task learning. 
Indeed, we present bounds for multi-task learning that relies on this assumption. Without the group-sparsity 
assumption, it might be better to use W 2i 2 over W 2j i. 

Finally, we discuss when it makes sense to use Wsti)- Recall that ||W||,5(i) = ||er(W)||i, where er(W) 
is the vector of singular values of W, and HXjjs^) = ||c(X)|| co . Therefore, the class Ws(i) should be 



used when we assume that the spectrum of W is sparse while the spectrum of X is dense. This means that 
the prior knowledge we employ is that W is of low rank while X is of high rank. Note that Y\>2,2 can be 
defined equivalently as Wsn)- Therefore, the difference between Wgm and W-2,2 is similar to the difference 
between Wi,i and Y\>2,2 just that instead of considering sparsity properties of the elements of W and X we 
consider sparsity properties of the spectrum of W and X. 

In the next sections we demonstrate how to apply the general methodology described above in order to 
derive a few generalization and regret bounds for problems of recent interest. 

4 Multi-task learning 

Suppose we are simultaneously solving fc-multivariate prediction problems, where each learning example is 
of the form (X, y) where X £ M. kxd is a matrix of example vectors with examples from different tasks sitting 
in rows of X, and y £ R fe are the responses for the k problems. To predict the k responses, we learn a matrix 
W £ R fexd such that Diag(W T X) is a good predictor of y. In this section, we denote row j of W by 
w J . The predictor for the jth task is therefore w J . The quality of a prediction (w J , xP\ for the j'th task is 
assessed by a loss function P : R x y 3 — > R; And, the total loss of W on an example (X, y) is defined to be 
the sum of the individual losses, 

k 

l(W,X,y) = ^P((wVi),|,i). 

3=1 

This formulation allows us to mix regression and classification problems and even use different loss functions 
for different tasks. Such "heterogeneous" multi-task learning has attracted recent attention [Yang et al., 2009] . 

If the tasks are related, then it is natural to use regularizers that "couple" the tasks together so that similar- 
ities across tasks can be exploited. Considerations of common sparsity patterns (same features relevant across 
different tasks) lead to the use of group norm regularizers (i.e. using the comparison class W2,i defined in 
the previous section) while rank considerations (the w J 's lie in a low dimensional linear space) lead to the 
use of unitarily invariant norms as regularizers (i.e. the comparison class is Ws(i))- 

We now describe online and batch multi-task learning using different matrix norm. 

4.1 Online multi-task learning 

In the online model, on round t the learner first uses W t to predict the vector of responses and then it pays the 

cost Z t (Wt) = Z(W t ,X t ,y t ) = E)=i^ U^tM)M)- Let V * e K fexd be a sub-gradient ofZ t at W t . It 

is easy to verify that the j'throw of V t , denoted v{, is a sub-gradient of P ^Wf ,x^ atWj. Assuming 

that P is p-Lipschitz with respect to its first argument, we obtain that w\ = t\ xj. for some t\ £ [—p, p]. In 
other words, V t = Diag(r t ) X t . It is easy to verify that || V f || r p < p ||X|| riP for any r,p > 1. In addition, 
since any Schatten norm is sub-multiplicative we also have that || V^ls^) < ||Diag(r t )||5( 00 ) HX^^f^) < 
p (^(^(oo). We therefore obtain the following: 

Corollary 18 Let VVl,i, W2,2, W2.1, Ws(i) be the classes defined in Section 3 and lets 
X 00>001 X2.2, ^2.00, ^s(oo) be the radius of X w.r.t. the corresponding norms. Then, there exist 
online multi-task learning algorithms with regret bounds according to Table 1. 

Let us now discuss few implications of these bounds, and for simplicity assume that k < d. Recall that 
each column of X represents the value of a single feature for all the tasks. As discussed in the previous 
section, if the matrix X is dense and if we assume that W is sparse, then using the class Wu is better than 
using W2.2- Such a scenario often happens when we have many irrelevant features and only are few features 
that can predict the target reasonably well. Concretely, suppose that X £ {0, l} fcxd and that it typically has 
s x non-zero values. Suppose also that there exists a matrix W that predicts the targets of the different tasks 
reasonably well and has s w non-zero values. Then, the bound for Wi,i is order of s w \J\\\(dk)/n while the 
bound for W2,2 is order of y/ s w s x /n. Thus, Wi,i will be better if s w < s x / ln(dfc). 

Now, consider the class W^i- Let us further assume the following. The non-zero elements of W are 
grouped into s g columns and are roughly distributed evenly over those columns; The non-zeros of X are 
roughly distributed evenly over the columns. Then, the bound for W2,i is s g y/(s w /s g ) (s x /d) ln(ii)/n = 

\/ s g s w (s x /d) ln(d) jn. This bound will be better than the bound of W2,2 if s g ln(d) < d and will be better 
than the bound of W14 if s g s x /d < s w . We see that there are scenarios in which the group norm is better 
than the non-grouped norms and that the most adequate class depends on properties of the problem and our 
prior beliefs on a good predictor W. 



As to the bound for Ws(i), it is easy to verify that if the rows of W sits in a low dimensional subspace then 
the spectrum of W will be sparse. Similarly, the value of HXHs^) depends on the maximal singular value 
of X, which is likely to be small if we assume that all the "energy" of X is spread over its entire spectrum. 
In such cases, Ws(i) can be the best choice. This is an example of a different type of prior knowledge on the 
problem. 

4.2 Batch multi-task learning 

In the batch setting we see a dataset T = ((Xi,yi), . . . , (X„, y„)) consisting of i.i.d. samples drawn from 
a distribution T> over X x y. In the fc-task setting, X C R fexd . Analogous to the single task case, we define 



the risk and empirical risk of a multitask predictor W € 

k 



pkxd 



as: 



£(W) := 1 EE Z 



L(W) 



HX,y)~X> 



= 1 3=1 



Let W be some class of matrices, and define the empirical risk minimizer, W := argmin WeW L(W). To 
obtain excess risk bounds for W, we need to consider the fc-task Rademacher complexity 



TZ^(W) := E 



sup - e\ 



L(W) L(W) 



< 



because, assuming each I 3 is p-Lipschitz, we have the bound E 

/?E [TZj- (W)] . This bound follows easily from Talagrand's contraction inequality and Thm. 8 in Maurer 
[2006]. We can use matrix strong convexity to give the following fc-task Rademacher bound. 

Theorem 19 (Multitask Generalization) Suppose F(W) < ,f max for all W g W for a function F that is 
(3-strongly convex w.r.t. some (matrix) norm || • ||. If the norm |j • |j* is invariant under sign changes of the 

rows of its argument matrix then, for any dataset T, we have, lZfj-(W) < X 2j ^ ax , where X is an upper 

bound on IIXJL. 



Proof: We can rewrite TZ^j-(W) as 



E 



sup - VVf^ 



= E 




WEW n 



where X, G M. kxd is defined by X^ = e^X^ and we have switched to a matrix inner product in the last line. 
By the assumption on the dual norm || • || + , ||Xj||* = ||Xj||+ < X. Now using Corollary 4 and proceeding 



as in the proof of Theorem. 6, we get, for any A > 0, lZtj-(W) < 
the theorem. 



/max 

An 



2/3 



Optimizing over A proves 



Note that both group (r, p) -norms and Schatten-p norms satisfy the invariance under row flips mentioned 
in the theorem above. Thus, we get the following corollary. 



Corollary 20 Let Wi,i, W2,2, W2,i, Wgm be the classes defined in Section 3 and lets 
-^oo.oo, ^"2,2, ^2,00, -^S(oo) be the radius of X w.r.t. the corresponding norms. Then, the (expected) 
excess multitask risk of the empirical multitask risk minimizer W satisfies the same bounds given in Table 1. 



5 Multi-class learning 

In this section we consider multi-class categorization problems. We focus on the online learning model. On 
round t, the online algorithm receives an instance xt £ M. d and is required to predict its label as a number in 
{1, . . . , fc}. Following the construction of Crammer and Singer [2000], the prediction is based on a matrix 
Wt £ M. kxd and is defined as the index of the maximal element of the vector Wt.Tt. We use the hinge-loss 
function adapted to the multi-class setting. That is, 

Zi(Wj) = max(l [r#yt] - ((wf ,x t ) - (wj> t ))) = max(l [r ^ t] - ((W,X[*))) , 

where Xj' yt is a matrix with x t on the y'th row, —x t on the r'th row, and zeros in all other elements. It is easy 
to verify that /t(W t ) upper bounds the zero-one loss, i.e. if the prediction of W t is r then Zt(W t ) > l^ r ^ yt y 



A sub-gradient of Zj(Wt) is either a matrix of the form — X^' or the all zeros matrix. Note that each 
column of X^' 2 " is very sparse (contains only two elements). Therefore, 

11X^11^ = 11^11^; ||X™|| 2j2 = V2|H| 2 ; ||X^|| 2i00 = ^IHU ; ||X™ \\ s(oo) = V2 \\x t \\ 2 
Based on this fact, we can easily obtain the following. 

Corollary 21 Let Wi,i, W 2j2 , W 2) i, Ws(i) be the classes defined in Section 3 and let X 2 = max t ||xt|| 2 
and Xqo = max t ||xt||oo- Then, there exist online multi-class learning algorithms with regret bounds given 
by the following table 



class 


Wi,i 


W 2 ,2 


W 2 ,i 


Ws(i) 


bound 




W 2 . 2 X 2 










JI7 v /ln(min{d,fc}) 



Let us now discuss the implications of this bound. First, if X 2 sa X x , which will happen if instance vectors 
are sparse, then Wi,i and W 2 ,i will be inferior to W 2 , 2 . In such a case, using Wsm can be even better if 
W sits in a low dimensional space but each row of W still has a unit norm. Using m sucn a case was 

previously suggested by Amit et al. [2007], who observed that empirically, the class Ws(i) performs better 
than W 2 ,2 when there is a shared structure between classes. The analysis given in Corollary 21 provides a 
first rigorous explanation to such a behavior. 

Second, if X 2 is much larger than X^,, and if columns of W share common sparsity pattern, then W 2 ,i 
can be factor of s/k better than Wi,i and factor of \fd better than W 2j2 . To demonstrate this, let us assume 
that each vector xt is in {±l} d and it represents experts advice of d experts. Therefore, X 2 = ^/dX^. Next, 
assume that a combination of the advice of s <C d experts predicts very well the correct label (e.g., the label 
is represented by the binary number obtained from the advice of s = log(fc) experts). In that case, W will be 
a matrix such that all of its columns will be except s columns which will take values in {±1}. The bounds 
for Wi,i,W 2j2 , and W 2 ,i in that case becomes ksyj]n.(kd), s/ksd, and ^ks ln(rf) respectively. That is, 
W 2 ,i is a factor of y/7ts better than Wi,i and a factor of s/d better than yV 2 ,2 (ignoring logarithmic terms). 
The class Ws(i) will also have a dependent on \fd in such a case and thus it will be much worse than W 2j2 
when d is large. 

For concreteness, we now utilize our result for deriving a group Multi-class Perceptron algorithm. To the 
best of our knowledge, this algorithm is new, and based on the discussion above, it should outperform both 
the multi-class Perceptron of Crammer and Singer [2000] as well as the vanilla application of the p-norm 
Perceptron framework of Gentile [2003], Grove et al. [2001] for multi-class categorization. 

The algorithm is a specification of the general online mirror descent procedure (Algorithm 1) with 
/(W) = |||W|| 2 r , r = log(d)/(log(<£) — 1), and with a conservative update (i.e., we ignore rounds on 
which no prediction mistake has been made). Recall that the Fenchel dual function is /*(V) = i|| V|| 2 p 
where p = (1 — 1/r) -1 = log(d). The element of the gradient of /* is 

(W*( v ))m = T^h-Kj- C2) 



Algorithm 2 Group Multi-class Perceptron 

p = log d 

Vi = e R kxd 

fort = 1,. .,Tdo 

Set Wi = V/*(V t ) (as defined in Eq. (2)) 

Receive x t G R d 

y t = argmax re[fc] (W t x t ) r 

Predict y t and receive true label yt 

U t € R fexd is the matrix with x t in the y t row and — x t in the y t row 
Update: V t+ i = V t - U t 
end for 



To analyze the performance of Algorithm 2, let I C [n] be the set of rounds on which the algorithm made 
a prediction mistake. Note that the above algorithm is equivalent (in terms of the number of mistakes) to an 
algorithm that performs the update V t+i = V t + r]XJ t for any ?y (see Gentile [2003]). Therefore, we can 



apply our general online regret bound (Corollary 16) on the sequence of examples in I we obtain that for any 
W 

5>(W t )-£>(W) < O (Xoo || W|| 2 ,i Vbg(d) |J| 
Recall that Zt(W t ) upper bounds the zero-one error and therefore the above implies that 



I\-J2h(W) < O ||W|| 2 .i V^g(d)\I 



tei 

Solving for | J| we conclude that: 

Corollary 22 The number of mistakes Algorithm 2 will make on any sequence of examples for which 
|| £t || 00 < ^00 is upper bounded by 



mm £ l t (W) +O^X x ||W|| 2 ,i J\og(d) £ L t (W) 
6 Kernel learning 

We briefly review the kernel learning setting first explored in Lanckriet et al. [2004]. Let X be an input space 
and let T = (xi , . . . , x„) £ X n be the training dataset. Kernel algorithms work with the space of linear func- 
tions, {x H> 2~27=i a i^( x j j x ) : a i € K}. In kernel learning, we consider a kernel/ara//;y /C and consider the 
class, {x H> 2~27=i a i^( x ij x ) : £ a i £ I R particular, we can choose a finite set {-KTi, . . . , Kk} 
of base kernels and consider the convex combinations, /C+ = |X)j=i Mj-^j : A*j — 0' SjLi Mj = l} ■ 
This is the unconstrained function class. In applications, one constrains the function class in some way. The 
class considered in Lanckriet et al. [2004] is 

n k k 

T Kt = \ x h-> J2 o«K(x4, ■) ■ K = Y1 PiK-h W > 0, & = !. a ' ^CH" < V7 2 )> O) 

i=i 3=1 j=i 

where 7 > is a margin parameter and K(T)i.j = K(xi,Xj) is the Gram matrix of K on the dataset T. 

Theorem 23 (Kernel learning) Consider the class ^F^+ defined in Eq. (3). Let Kj(x, x) < B for 1 < j < k 
andxe X. Then, K T {F Kt ) < eJ^£ . 

The proof follows directly from the equivalence between kernel learning and group Lasso Bach [2008], and 
then applying our bound on the class W-2,i- F° r completeness, we give a rigorous proof in the appendix. 

Note that the dependence on the number of base kernels, k, is rather mild (only logarithmic) — implying 
that we can learn a kernel as a (convex) combination of a rather large number of base kernels. Also, let 
us discuss how the above improves upon the prior bounds provided by Lanckriet et al. [2004] and Srebro 
and Ben-David [2006] (neither of which had logarithmic k dependence). The former proves a bound of 

Bk 

Srebro and Ben-David [2006] as they do not work with Rademacher complexities. However, if one compares 

/ /felog *4f + 4 log log a# \ 
the resulting generalization error bounds, then their bound is O V — - — ^—^ — — — and ours is 

O ( yj B }°\ 1 ^ ■ If k > n, their bound is vacuous (while ours is still meaningful). If k < n, our bound is 
better. 

Finally, we note that recently Ying and Campbell [2009] devoted a dedicated effort to derive a result 
similar to Theorem. 23 using a Rademacher chaos process of order two over candidate kernels. In contrast to 
their proof, our result seamlessly follows from the general framework of deriving bounds using the strong- 
convexity /strong-smoothness duality. 

Acknowledgements 

We thank Andreas Argyriou, Shmuel Friedland & Karthik Sridharan for helpful discussions. 



O [ \l ) which is quite inferior to our bound. We cannot compare our bound directly to the bound in 



References 



Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for online multitask learning. 
Technical report, EECS Department, University of California, Berkeley, 2008. 

Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncovering shared structures in multiclass classifica- 
tion. In Proceedings of the 24th International Conference on Machine Learning, 2007. 

Francis Bach. Consistency of the group lasso and multiple kernel learning. JMLR, 9, 2008. 

Keith Ball, Eric A. Carlen, and Elliott H. Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. 

Invent. Math., 115:463^82, 1994. 
P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of 

Machine Learning Research, 3:463^182, 2002. 
R. Bhatia. Matrix Analysis. Springer, 1997. 

J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. 

G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask classification. In Proceedings of 
the Nineteenth Annual Conference on Computational Learning Theoiy, pages 251-262, 2008. 

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. 

K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In Proceedings of the 
Thirteenth Annual Conference on Computational Learning Theory, 2000. 

C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265-299, 2003. 

A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine 

Learning, 43(3): 173-210, 2001. 
A. Juditsky and A. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces, submitted to 

Annals of Probability, 2008. 

S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and 
regularization. In Advances in Neural Information Processing Systems 22, 2008. 

J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learn- 
ing, 45(3):301-329, July 2001. 

J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and 
Computation, 132(1): 1-64, January 1997. 

GR.G Lanckriet, N. Cristianini, PL. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semidefinite 
programming. Journal of Machine Learning Research, 5:27-72, 2004. 

A. S. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2(2): 173-183, 
1995. 

N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 
2:285-318, 1988. 

Karim Lounici, Massimiliano Pontil, Alexandre B Tsybakov, and Sara van de Geer. Taking advantage of sparsity in 

multi-task learning. arXiv:0903.1468, Mar 2009. 
Andreas Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, 2006. 
R. Meir and T. Zhang. Generalization error bounds for Bayesian mixture algorithms. Journal of Machine Learning 

Research, 4:839-860, 2003. 

A.Y Ng. Feature selection, l\ vs. h regularization, and rotational invariance. In Proceedings of the Twenty-First Inter- 
national Conference on Machine Learning, 2004. 

G. Obozinski, B. Taskar, and M Jordan. Joint covariate selection for grouped classification. Technical Report 743, Dept. 
of Statistics, University of California Berkeley, 2007. 

I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. Ann. Probab, 22(4): 1679-1706, 1994. 

G. Pisier. Martingales with values in uniformly convex spaces. Israel Journal of Mathematics, 20(3^1):326-350, 1975. 

R.T Rockafellar. Convex Analysis. Princeton University Press, 1970. 

S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications . PhD thesis, The Hebrew University, 2007. 
S. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In Advances in Neural Information 
Processing Systems 20, 2006. 

S. Shalev-Shwartz and Y. Singer. A primal-dual perspective of online learning algorithms. Machine Learning Journal, 
2007. 

S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: New relaxations 
and efficient boosting algorithms. In Proceedings of the Nineteenth Annual Conference on Computational Learning 
Theory, 2008. 

N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In Proceedings of the 
Nineteenth Annual Conference on Computational Learning Theory, pages 169-183, 2006. 

M. Warmuth and D. Kuzmin. Online variance minimization. In Proceedings of the Nineteenth Annual Conference on 
Computational Learning Theory, 2006. 

X. Yang, S. Kim, and E. P. Xing. Heterogeneous multitask learning with joint sparsity constraints. In Advances in Neural 
Information Processing Systems 23, 2009. 

Y. Ying and C. Campbell. Generalization bounds for learning the kernel. In COLT, 2009. 

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical 

Society: Series B, 68(l):49-67, 2006. 
C. Zalinescu. Convex analysis in general vector spaces. World Scientific Publishing Co. Inc., River Edge, NJ, 2002. 



A Convex Analysis and Matrix Computation 

A.l Convex analysis 

We briefly recall some key definitions from convex analysis that are useful throughout the paper (for details, 
see any of the several excellent references on the subject, e.g. Borwein and Lewis [2006], Rockafellar [1970]). 
We consider convex functions / : X — > R U {oo}, where X is a Euclidean vector space equipped with an 
inner product (•, •). We denote 1* = 1U {oo}. Recall that the subdifferential of / at x £ X, denoted 
by df(x), is defined as df(x) := {y £ X : Vz, f(x + z) > f(x) + (y,z)}. The Fenchel conjugate 
/* : X -> R* is defined as f*(y) := sup^^ (x, y) - f{x). 

We also deal with a variety of norms in this paper. Recall that given a norm || • || on A", its dual norm is 
defined as ||y||* := sup{(.T,y) : ||.t|| < 1}. An important property of the dual norm is that the Fenchel 
conjugate of the function i||x|| 2 is 

The definition of Fenchel conjugate implies that for any x, y, f(x) + f*(y) > (x, y), which is known as 
the Fenchel- Young inequality. An equivalent and useful definition of the subdifferential can be given in terms 
of the Fenchel conjugate: df(x) = {y £ X : f(x) + f*(y) = (x, y)}. 

A.2 Convex analysis of matrix functions 

We consider the vector space X = R mx ™ of real matrices of size m x n and the vector space X = S n 
of symmetric matrices of size n x n, both equipped with the inner product, (X, Y) := Tr(X T Y). Recall 
that any matrix X £ fl£»«x™ can b e decomposed as X = UDiag(<x(X))V where cr(X) denotes the vector 
(<7i, (72, . . . cri) (I = min{m, n}), where a\ > o<z > . . . > u\ > are the singular values of X arranged in 
non-increasing order, and U £ R mxm , V £ R nx ™ are orthogonal matrices. Also, any matrix X £ E> n can 
be decomposed as, X = UDiag(A(X))U T where A(X) = (Ai, A2, . . . A„), where Ai > A2 > . . . > A„ are 
the eigenvalues of X arranged in non-increasing order, and U is an orthogonal matrix. Two important results 
relate matrix inner products to inner products between singular (and eigen-) values 

Theorem 24 (von Neumann) Any two matrices X, Y £ R mxn satisfy the inequality 

(X,Y) < <a(X),a(Y)) . 
Equality holds above, if and only if, there exist orthogonal U, V such that 

X = UDiag(cr(X))V Y = UDiag(cr(Y))V . 

Theorem 25 (Fan) Any two matrices X,Yt §" satisfy the inequality 

(X,Y)<(A(X),A(Y)) . 
Equality holds above, if and only if, there exists orthogonal U such that 

X = UDiag(A(X))U T Y = UDiag(A(Y))U T . 

We say that a function g : R n — > R* is symmetric if g(x) is invariant under arbitrary permutations of the 
components of x. We say g is absolutely symmetric if g(x) is invariant under arbitrary permutations and sign 
changes of the components of x. 

Given a function / : R' — s- R*, we can define a function / o a : M. mxn — » R* as, 

(/o<r)(X) :=/WX)). 
Similarly, given a function g : R™ — > R*, we can define a function g o A : S™ — > R* as, 

( 5 oA)(X) :=. 9 (A(X)). 

This allows us to define functions over matrices starting from functions over vectors. Note that when we use 
/ o a we are assuming that X = M mxn and for jo Awe have X = S n . The following result allows us to 
immediately compute the conjugate of / o a and g o A in terms of the conjugates of / and g respectively. 

Theorem 26 (Lewis [1995]) Let f : R ( — > R* be an absolutely symmetric function. Then, 

(foa)* = f*oa. 
Let g : R" — > R* be a symmetric function. Then, 

(go\y=g*o\. 

Proof: Lewis [1995] proves this for singular values. For the eigenvalue case, the proof is entirely analogous 
to that in Lewis [1995], except that Fan's inequality is used instead of von Neumann's inequality. ■ 



Using this general result, we are able to define certain matrix norms. 



Corollary 27 (Matrix norms) Let f : M. — > M* be absolutely symmetric. Then if f = \\ ■ \\ is a norm on M 1 
then f o a — |cr(-)| is a norm on R mxn . Further, the dual of this norm is ||cr(-)|| Vr . 

Let g : R™ — > K* be symmetric. Then if g = j • || is a norm on W l then g o A = || A(-)|| is a norm on S™. 
Further, the dual of this norm is |j A(-)||*. 

Another nice result allows us to compute subdifferentials of / o a and go A (note that elements in the 
subdifferential of / o a and joA are matrices) from the subdifferentials of / and g respectively. 

Theorem 28 (Lewis [1995]) Let f : R l -> K* be absolutely symmetric and X € K mx ™. Then, 

d(f o (t)(X) = {UDiag(/i)V T : fj, E <9/(ct(X))U, V orthogonal, X = UDiag(cr(X))V T } 
Let g :R n R* be symmetric and X G § n . 77i<?n, 

<9(.g o A)(X) = {UDiagO)U T : fi g dg(A(X))U orthogonal, X = UDiag(A(X))U T } 

Proof: Again, Lewis [1995] proves the case for singular values. For the eigenvalue case, again, the proof is 
identical to that in Lewis [1995], except that Fan's inequality is used instead of von Neumann's inequality. ■ 

B Technical Proofs 
B.l Proof of Theorem. 3 

First, [Shalev-Shwartz, 2007, Lemma 15] yields one half of the claim (/ strongly convex => /* strongly 
smooth). It is left to prove that / is strongly convex assuming that /* is strongly smooth. For simplicity 
assume that f3 = 1. Denote g(y) = f*(x + y) — {f*{x) + (\7f*(x), y)). By the smoothness assumption, 
g(y) < ^IMI*- This implies that g*(a) > ^||a|| 2 because of [Shalev-Shwartz and Singer, 2008, Lemma 19] 
and that the conjugate of half squared norm is half squared of the dual norm. Using the definition of g we 
have 

g*(a) =sup(y,a) - g(y) 
y 

= sup (y, a) - (f*(x + y) - (/*(*) + (V/* (*), V))) 
y 

= S up(y,a + Vr(x))-f*(x + y) + r(x) 
y 

= sup (z-x,a + V/*(z)) - f[z) + f*{x) 

= /(a + V.r(.T)) + t(x) -(x,a + Vf*(x)) 

where we have used that /** = /, in the last step. Denote u = Wf*(x). From the equality in Fenchel-Young 
(e.g. [Shalev-Shwartz and Singer, 2008, Lemma 17]) we obtain that (x, u) = f*(x) + f(u) and thus 

g*(a) = f(a + u) - f(u) - (x, a) . 

Combining with g*(a) > ^\\a\\ 2 , we have 

f(a + u)- f(u)-(x,a) > i||a|| 2 , (4) 

which holds for all a, x, with u = V/*(x). 

Now let us prove that for any point u' in the relative interior of the domain of / that if x 6 df(u') 
then u' = V/*(x). Let u := V f*{x) and we must show that u' = u. By Fenchel-Young, we have that 
(x, v!) = f*(x) + f(u'), and, again by Fenchel-Young (and /** = /), we have (x, u) = f*(x) + f(u). We 
can now apply Equation Eq. (4), to obtain: 

= -/(«)-(<*,«') "/(«')) 

= f{u')-.f{u)-{x,u'-u) > l -\\u'-u\\\ 

which implies that u' = V/*(x). 

Next, let ui, U2 be two points in the relative interior of the domain of /, let a G (0, 1), and let u = 
otu\ + (1 — a)u2- Let x £ df(u) (which is non-empty '). We have that u = Wf*(x), by the previous 

'The set df(u) is not empty for all u in the relative interior of the domain of /. See the relative max formula 
in [Borwein and Lewis, 2006, page 42] or [Rockafellar, 1970, page 253]. If u is not in the interior of /, then df(u) is 
empty. But, a function is denned to be essentially strictly convex if it is strictly convex on any subset of {it : df(u) ^ 0}. 
The last set is called the domain of df and it contains the relative interior of the domain of /, so we're ok here. 



argument. Now we are able to apply Equation Eq. (4) twice, once with a = u\ — u and once with a = u 2 — u 
(and both with x) to obtain 

/(ui) - f(u) - {x, ui - u) > ^\\ Ul -u\\ 2 

/("a) - /(«) - {z,u 2 - u) > ^\\u 2 -u\\ 2 

Finally, summing up the above two equations with coefficients a and 1 — a we obtain that / is strongly 
convex. 

B.2 Proof of Theorem. 12 

Note that an equivalent definition of er-smoothness of a function / w.r.t. a norm || • || is that, for all x, y and 
a G [0, 1], we have 

f(ax + (1 - a)y) > af(x) + (1 - a)f(y) - -aa(l - a)\\x - y\\ 2 . 

Let X, Y <E R mx ™ be arbitrary matrices with columns X* and Y 1 respectively. We need to prove 

||(1 - a)X + aY\\% t9 > a||X||| i$ + (1 - a)||Y||| i4 -\{<n + <r 3 )a(l - a)||X - Y\\%^ . (5) 

Using smoothness of $ and that $ is a Q-norm, we have, 
||(1 - a)X + aY\\%^ = ($ 2 o v /)(. . . , ^(aX l + (1 - a)Y l ), . . .) 

> (3> 2 o . . ,a* 2 (X l ) + (1 - a)^ 2 (Y l ) - \aia(l - a)* 2 (X l - Y 1 ), . . .) 



> 



($ 2 o^)(...,a* 2 (X') + (l-«)* 2 (Y*),...) 
-]-a 1 a(^c e )(<f 2 o V )(...^ 2 (^-Y^,...) 



$ 2 (. ■ • , Va* 2 (X' i ) + (l-a)* 2 (Y' ; ), . • .) - -a ia {l - a)||X - Y\\%^ . (6) 



Now, we use that, for any x, y > and a G [0, 1], we have y'ai 2 + (1 — a)y 2 > ax + (1 — a)y. Thus, we 
have 



$ 2 (. . . , v /«* 2 (X' i ) + (l -a)* 2 (Y l ), . . .) 

> $ 2 (. . . , a*(X ? ) + (1 - a)*(Y 4 ), . . .) 

> a$ 2 (. . . , *(X 4 ), . . .) + (1 - a)$ 2 (. . . , *(Y 4 ), . . .) 

- \ 2 a(l - a)<S> 2 (. . . , *(X 4 ) - (Y*), . . .) 

> a||X||| i$ + (1 - a)||Y|||^ - la 2 a(l - a)$ 2 (. . . , *(X J - Y*), . . .) 



a||X||* + (1 - a)||Y|| 2 - -a 2 a(l - a)\\X - Y|| 2 4 



1 

Plugging this into Eq. (6) proves Eq. (5). 
B.3 Proof of Theorem. 23 

Let %j be the RKHS of if,-, Hj = |X)Li a^( x *' •) : / > 0, Xj G A", a G M ; | equipped with the inner 
product 

jl m \ 

\t=l i=l / i,j 

Consider the space % = Hi x . . . x Hfe equipped with the inner product (u, v) := (uj, Vi) H .. For 

uJ G H, let || • || 2,1 be the norm defined by ||iy||2,i = Y^i=i II ^11 Hi • It is easy to verify that F K + C 7> where 
T r := {x ^ ^w, 0(x)^ : wen, \\w\\ 2 ,i < 1/7} , and 0(x) = (Kx(x, •),..., K k (x, •)) 6?i. Since 

ll-R 'j(x, -)||Hj < V^Bj we a l so nave ||<^( x )||2.s < k x l s \[~B for any x G A". The claim now follows directly 
from the results we derived in Section 2. 



